This Page Is Inserted by IFW Operations 
and is not a part of the Official Record 

BEST AVAILABLE IMAGES 



Defective images within this document are accurate representations of 
the original documents submitted by the applicant. 

Defects in the images may include (but are not limited to): 



• BLACK BORDERS 

• TEXT CUT OFF AT TOP, BOTTOM OR SIDES 

• FADED TEXT 

• ILLEGIBLE TEXT 

• SKEWED/SLANTED IMAGES 

• COLORED PHOTOS 

• BLACK OR VERY BLACK AND WHITE DARK PHOTOS 



• GRAY SCALE DOCUMENTS 



IMAGES ARE BEST AVAILABLE COPY, 



As rescanning documents will not correct images, 
please do not report the images to the 
Image Problem Mailbox. 



( 



THIS PAGE iUNKtusrto* 



PCI 



ORLD IKTELLECiuAL PROPER i i ORGAWIZ 
International, Bureau 




INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(51) International Patent Classification 5 
G06F 15/80 



Al 



(11) International Publicati n Number: 
(43) International Publication Date: 



WO 92/03802 

5 March 1992 (05.03.92) 



(21) International Application Number: PCT/GB91/01390 

(22) International Filing Date: 15 August 1991 (15.08.91) 



(30) Priority data: 

9018048.0 



16 August 1990 (16.08.90) GB 



(71) Applicant (for all designated States except US): THE SE- 

CRETARY OF STATE FOR DEFENCE IN HER BRI- 
TANNIC MAJESTY'S GOVERNMENT OF THE UN- 
ITED KINGDOM OF GREAT BRITAIN AND^NpR- 
THERN IRELAND [GB/GB]; Whitehall, London 
SW1A2HB(GB). -*~ 

(72) Inventors; and 

(75) Inventors/Applicants (for US only) : JOHNSON, Martin 
[GB/GB]; 11 Fruitlands, Malvern Wells, Worcestershire 
WR14 4AH (GB). JONES, Robin [GB/GB]; 25 Sand- 
piper Crescent, Malvern Link, Worcestershire WR14 
IVY(GB). BROOMHEAD, David, Sidney [GB/GB] ; 3 
Assarts Road, Malvern Weils, Worcestershire WR 14 
4HW (GB). 



(74) Agent: BECKHAM, Robert, William; IPD/DRA, Room 
2016, Empress State Building, Lillie Road, London SW6 
1TR (GB). 



(81) Designated States: AT (European patent), BE (European 
patent); GH (European patent), DE- (European patent), 
DK (European patent), ES (European patent), FR (Eu- 
ropean patent); GB, GB (European patent), GR (Euro- 
pean patent), IT (European patent), JP, LU (European 
patent), NL (European patent), SE (European patent), 
US. 



Published 7 - ^ 

With international search report. 



(54) Title: DIGITAL PROCESSOR FOR SIMULATING OPERATION OF A PARALLEL PROCESSING ARRAY 



(57) Abstract 



A digital processor for simulating operation of a parallel pro- 
cessing array incorporates digital processing units (Pi to Pg) commu- 
nicating data to one another via addresses in memories (Mo to Mg) 
and registers (Ru to R 41 ). Each processing unit (e.g. Pi) is pro- 
grammed to input data and execute a computation involving up- 
dating of a stored coefficient followed by data output. Each com- 
putation involves use of a respective set of data addresses for data 
input and output, and each processing unit (e.g. Pi) is pro- 
grammed with a list of such sets employed in succession by that 
unit. On reaching the end of its list, the processing unit (e.g. Pi) re- 
peats it. Each address set is associated with a conceptual internal 
cell location in the simulated array (10), and each list is associated 
with a respective sub-array of the simulated array (10). Data is in- 
put cyclically to the processor (40) via input/output ports (I/O5 t0 
I/Og) of some of the processing units (P 5 to P 8 ). Each processing 
unit (e.g. Pj) executes its lisfcof address sets within a cycle 'at a rate 
of one address set per subcycle. At the end of its list, each of the 
processing units (Pi to Pg) has executed ;the._functions associated 
with a conceptual respective sub-array of simulated cells (12), and 
the processor (40) as a whole has simulated operation of one cycle 
of a systolic array (10). Repeating the address set lists with further 
processor input provides successive simulated array cycles. 
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DIGITAL PROCESSOR FOR SIMULATING OPERATION OF A PARALLEL 
PROCESSING ARRAY 

This invention relates to a digital processor for simulating operation of a parallel 
05 processing array, such as a systolic array. 

The field of parallel processing arrays was developed to overcome a well-known 
problem in conventional digital computers, the "Von Neumann bottleneck". This 
problem arises from Jthe serial nature of conventional computers, in which- 
10 programme Steps or instructions are executed one at a time and in succession. 
This means that the computer operating speed is restricted to the rate at which 
* * its central processing unit executes individual instructions. 

To overcome the operating speed problem of conventional computers, parallel 
15 processors based on systolic array architectures have been developed. One such 
is disclosed in British Patent No. GB 2, 151, 378B, which corresponds to United 
States Patent No. 4,727,503. It consists of a triangular array of internal and 
boundary cells. ~ The boundary * cells form the array diagonal" ""and are 
interconnected via delay latches. The internal cells are in above -diagonal 
20 locations. The array includes nearest-neighbour cell interconnection lines defining 
rows and columns of ceils. The cells are activated cyclically by a common 
system clock. Signal flow is along the rows and down the columns at the rate 
of one cell per clock cycle/ Each cell executes a computational function on 
each clock cycle employing data input to the array and/or received from 
25 neighbouring cells. Computation results are output to neighbouring cells to 
provide input for subsequent computations. The computations of individual cells 
are comparatively simple, but the systolic array as a whole performs a much 
more complex calculation, and does so in a recursive manner at potentially high 
speed.- In effect, the array subdivides the complex calculations into a series ,of 
30 ~ ? much smaller cascaded calculations wmch are distributed over the array processing v 

- A - cells-. An external — control computer is nc required; The -cells - are 

clock-activated, each operates on every clock cycle. The maximum clock 
frequency or rate of processing is limited only by the rate at which the slowest 
individual cell can carry out its comparatively simple processing function. This 
35 results in a high degree of parallelism, with potentially high speed if fast 
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processing cells are employed. The "bottleneck" of conventional 
computers is avoided. 

The disadvantage of prior art systolic arrays is that, in all but the 
simplest problems, large numbers of cells are required. As will be 
described later in more detail, a prior art triangular array for dealing 
with an n-dimensional computation requires in the order of n 2 /2 internal 
cells. In consequence, the number of internal cells required grows as 
the- square of the number of dimensions of the computation. The number 
of boundary cells grows only linearly with number of dimensions. One 
important application of a triangular systolic array relates to 
processing signals from an array of sensors, such as a phased array of 
radar antennas. Typical radar phased arrays incorporate in the 
region of one thousand or more antennas, and a systolic array to process 
the antenna signals would require of the order of one million 
processing cells. Each cell required the processing functions and 
connectivity capabilities of a transputer to enable communications 
between -neighbouring-cells. Special purpose integrated-circuits- could 
also be used, in which "cells" constitute respective areas of a silicon 
chip or wafer. Since transputers are priced in excess of £100 each, 
the cost of a systolic array would be prohibitively expensive for radar 
phased array purposes. It is also prohibitively expensive for many 
other signal processing applications characterised by high 
dimensionality. 

There is a need for digital processing apparatus which has a degree of 
parallelism to overcome conventional computer disadvantages, but which 
requires fewer processing cells than a prior art systolic array. 
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It is known from EP - A - 0 021 404 to employ an array of specially 
designed processors in a computer system for . the simulation of logic 
operations. These processors operate in parallel. However, this prior 
art parallel array is disadvantageous in that data flow through it 
requires a multi-way switch operated by a computer. For i processors, 
the switch is i-by-i-way so that each processor can be connected to each 
of the others under computer control. This is not compatible with a 
systolic array architecture, in which (a) there is no controlling 
computer, (b) data flow paths in the array are fixed, (c) data flow is 
between nearest neighbours, (d) there are no external control 
instructions, and (e) conventional general purpose processors (eg 
transputers ) may be used with programming to execute fairly 
straightforward arithmetic functions. Indeed, a major objective of 
systolic array architectures is to avoid the need for a controlling 
computer. 

US Patent No. 4 9 622; 632 to Tanimoto et al. relates to a pattern 
--.--.. ma tehing device which employs arrays of processors for operating on 
pyramidal data structures. Here the processors operate under the 
control of what is said to be a "controller", by which is presumably " 
meant a control computer. The controller provides instructions. to each 
of the processors in synchrony. The instructions both provide data 
store addresses and dictate which of its various processing functions an 
individual processor employs. Each processor performs a 
read-modify-write cycle in which data in a memory module is written back 
out to the same address from which it was obtained. As discussed above 
for EP - A - 0,021,404, this is not compatible with a systolic array 
architecture, in which (a) there is no controlling computer, (b) data 
flow paths in the array are fixed, (c) data flow is between nearest 
neighbours, and (d) there are no external control instructions. 

It is an object of the present invention to provide a digital processor 
suitable for simulating operation of a parallel processing array such as 
a systolic array. 
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The present invention provides a digital data processor for simulating 
operation of a parallel processing array , the processor including an 
assembly of digital processing devices connected to data storing means, 
characterised in that:- 

(a) each processing device is programmed to implement a respective list 
of sets of storing means data addresses; 

(b) each address set contains input * data addresses and output da;ta 
addresses which differ, and each such set corresponds to data 
input/output functions of a respective simulated array cell; 

(c) each list of address sets corresponds to a respective sub-array of 
cells of the simulated array, and each such list contains pairs of 
successive address sets in which the leading address sets have . 
input data addresses like to output data addresses of respective 
successive address sets, each list being arranged to provide for 
operations associated with ^simulated cells to be executed -in 
reverse order to that corresponding to data flow through the 
simulated array; and 

(d) each processing device is programmed to employ a respective first 
address set to read input data from and write output data to the 
data storing means, the output data being generated in accordance 
with a computational function, to employ subsequent address sets in 
a like manner until the list is complete, and then to repeat this 
procedure cyclically . 
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The invention provides the advantage that it requires a reduced number 
of processing d vices compared to a prior art array (such as a systolic 
array) which it simulates. The reduction is in proportion to the number 
of address sets per list. Each processing device is in effect allocated 
the functions of a number or sub-array of similated array cells, and is 
programmed to execute the functions of a number of these cells in 
succession and then repeat. The simulated array operation is therefore 
carried out, albeit at a reduced rate. However, a degree of parallelism 
is preserved because the overall computation is distributed over an 
assembly of individual processing devices. In consequence, the 
parallelism advantage over a conventional computer is retained. The 
invention might be referred to as a semi -parallel processor. 

The invention may be arranged so that each processing device 
communicates with not more than four other processing devices; it may 
then incorporate storing means including register devices and memories 
connected between respective pairs of processing devices. The 
-invention may -incorporate storing means arranged to resolve addressing 
conflicts; preferably however the address lists are arranged such that 
each register device and memory is addressed by not more than one 
processing device at a time. Some of the processing devices may be 
arranged to communicate with two of the other processing devices via 
respective register devices. In this case the address set lists are 
arranged such that the register devices are addressed less frequently 
than the memories. 
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. ... Each processing device may be arranged to store and update -a respective 
05 coefficient in relation to each address set in its list. 

The invention may incorporate processing devices with input means arranged for 
parallel to serial conversion of input data elements. This enables the processor 
to implement simultaneous input as in the systolic array which it simulates. 

10 

In order that the invention might be more fully understood, embodiments thereof 
will now be described, by way of example only, with reference to the 
accompanying drawings, in which:- 

15 Figures 1, 2 and 3 illustrate the construction and mode of operation of a 

prior art systolic array; 

Figure 4 is a block diagram of a processor of the invention arranged to 
simulate part of the Figure 1 array and incorporating eight processing units; 

20 

Figure 5 illustrates the mode of operation of the Figure 4 processor 
mapped on to the Figure 1 array; 

Figure 6 illustrates read and write functions of a processing unit 
25 incorporated in the Figure 4 processor; 

Figure 7 illustrates memory and programming arrangements associated with 
individual processing units in the Figure 4 processor; 

30 Figure 8 schematically illustrates memory addressing in the Figure 4 

processor; 

Figure 9 is a block diagram of an input/output port for a processing unit; 



35 
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Figure 10 and 11 illustrate the construction and mode of operation of an 
alternative embodiment of the invention incorporating an odd number of 
processing units; and 

05 Figure 12 illustrates the mode of operation of a further embodiment of the 

invention incorporating four processing devices. 

Referring to Figure 1, a prior art . triangular systolic array 10 is shown 
schematically. The array 10 is of the kind disclosed in British Patent No. 

10 2,151,378B (US Pat. No. 4,727,503). It includes a 15 x 15 above-diagonal 
sub-array of internal cells indicated by squares 12. A linear chain of fifteen 
boundary cells 14 shown as circles forms the triangular array diagonal. Adjacent 
boundary cells 14 are interconnected via one-cycle delay cells or latches indicated 
by dots 16. A multiplier cell 18 is connected to the lowermost internal and 

15 boundary cells 12 and 14. Each of the cells 12 to 18 is activated by a system 
clock (not shown), and the cells 12 to 16 carry out pre-arranged computations 
on each clock cycle. Input to the array 10 is from above as indicated by arrows 
20: ^'HBfi^noa^^StpulS from' boundary cells 14 pass~ along array rows ~ as " 
indicated by intercell arrows 22. Outputs from internal cells 12 pass down array 

20 columns as indicated by vertical intercell arrows 24. Boundary cells 14 have 
diagonal inputs and outputs such as 26 and 28 interconnected along the array 
diagonal via latches 16. 

Referring now also to Figure 2, the processing functions of the internal and 
25 boundary cells 12 and 14 are shown in greater detail. On each clock cycle, 
each boundary cell 14 receives an input value xj n from above. It employs a 
stored coefficient r together with to compute cosine and sine rotation 

parameters c and s and an updated value of r in accordance with: 

30 ' \ 

r\ - _.[r +x.J . . (1) 

For x: n = 0, c = 1 and s = 0; 
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otherwise 



c - r/r' , s - * in A' (2) 



and r (updated) - r' (3) 

The parameters c and s are output horizontally to a neighbouring internal cell 12 
to the right. 

Each boundary cell 14 also multiplies an upper left diagonal input 5- in by the 
parameter c to provide a lower right diagonal output 5 out . 



ie 6 ^ - c6. (4) 
out in \ / 

This provides for cumulative multiplication of c parameters along the array 
diagonal. 



On each clock cycle, each internal cell 12 receives input of c and s parameters 
from the left and X| n from above. It computes x out and updates its stored 
coefficient r in accordance with:- 



x out - - sr + cx in (5 > 
25 r (updated) - cr + sx i n ( 6 ) 

Data input to the array 10 is illustrated schematically in Figure 3, in which the 
vertical dimension is shown foreshortened for illustrational convenience. Figure 3 
shows a first vector X| and a first element y^ in the process of input to the 
30 array 10. The vector X| has fifteen elements xjj to »i 15, and is the leading 
row °^ _ a 4 ata J? 3 *™* A column vector y is input to the rightmost array 

column. The vector ^ «as elements yj f y2 , and the nth element y n 

appears as an extension of the nth row x nl to x n i 5 of the data matrix X- As 
illustrated, yj extends xj. 

35 
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The first element x^j of the first input vector xj is input to the top row 
(leftmost) boundary cell 14. Successive elements etc of xj are input 

to successive top row internal cells 12 with a temporal skew. Temporal skews 
are well known in the art of systolic arrays. In the present case the skew is a 
05 delay of one clock cycle between input to adjacent top row cells of elements of 
like vectors. The skew increases linearly to the right, so that input of the ith 
element x,^ of the nth vector x n to the ith column of the array 10 lags input of 
x nl to *k e * irst column by clock cycles. 

10 When xjj is input to the uppermost boundary cell 14, it is employed to compute 
rotation parameters c, s for transforming the first vector xi into a rotated vector 
having a leading element of zero. On the clock cycle following input of xjj to 
the uppermost boundary cell 14, xj2 is input to its row neighbour internal cell 
12 in synchronism with input of c, s computed from.xji. One clock cycle later, 

15 the parameters c, s derived from xji reach the third cell from the left in the 
top row and are used to operate on X13. In this manner, c,s computed from 
X|| are employed to operate on elements Xjj to X| 15 and yj on successive 
clock cycles. This produces a "rotated version of x\ from which "xfi is" 
eliminated, the version passing to the second processor row. A similar procedure 

20 occurs in the second row, ie the rotated version of xj2 is used to compute c 
and s values for operation on the rotated versions of X13 to and yj. This 

procedure continues down the processor rows until all x-vector elements have 
been eliminated . 

25 Subsequent data vectors £2> X3 etc representing further rows of the data matrix 
X are processed in the same way as x^ by input to the uppermost array row. 
In general, the ith element x n { of the nth data vector x n is input to the ith 
array column on the (n + i + l)th clock cycle. Similarly, the nth element y n 
of the column vector y is rotated in each row as though it were an additional 

30 element of the nth data vector x n . Each cumulatively rotated version of y n 
passes to the multiplier cell 18. Here it is multiplied by the cumulatively 
multiplied c rotation parameters derived from x n and computed along the array 
boundary cell diagonal. The output of the multiplier cell 18 is the least squares 
residual e n given by:- 

35 
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where: is the transpose of Xj,, and 

is a weight vector computed over all xj to x n to minimise the 
sum of the squares of ej to e n . 

In more general mathematical terms, the array 10 carries out a QR 
decomposition of the data matrix X as described in the prior art; ie the rotation 
algorithm operates on X to generate a matrix Q such that:- 



2 X 



R 
0 



(8) 



where R is an upper right triangular matrix. The matrix . elements r of R are 
stored on individual internal and boundary cells 12 and 14 in a u but the 
rightmost array column, and are recomputed every clock cycle. At the end of 
20 computation, the elements r may be extracted from their storage locations and 
used to compute the weight vector explicitly. 

. > 

Figures 1 to 3 exemplify a typical prior art systolic array arranged inter alia to 
carry out QR decomposition. The array 10 exhibits the following characteristics 
25 which typify systolic arrays 

(a) nearest-neighbour cell interconnections form rows and columns; 

(b) many of ceUs <ie internal cells) have like signal processing., functions; 



(c) . each cell performs its processing function on each system- clock cycle; 
and 



(d) signal flow is generally down columns and along rows of the 



array. 



35 
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Systolic arrays suffer from the major disadvantage of requiring large numbers of 
processing cells, such as internal cells 12 in particular. To perform a QR 
decomposition on the data matrix X and associated residual extraction involving 
the vector y, the array 10 employs a linear chain of fifteen boundary cells 14 

05 and a triangular sub-array of one hundred and twenty internal cells 12. The 
internal cells 12 form a 15 x 15 sub-array, and the array 10 as a whole is a 16 
x 16 array. This arises from the fifteen-dimensional nature of the data matrix 
X and the one-dimensional nature of each element of the vector y^. Generally, 
the number of cells required in a systolic array grows as the square of the 

10 number of dimensions of the computation to be performed. In a version of the 
array 10 appropriate for an n-dimensional data matrix X f n(n + l)/2 internal 
cells 12 would be required. Each cell is of the order of complexity of a 
microprocessor having floating point arithmetic capability, and requires the ability 
of a transputer to communicate with up to four neighbours. For computations 

15 where n is in th*? order of 1C or greater, the number of cells is of order 10 4 
or more. The cost and bulk of such an array is therefore unacceptably large for 
many purposes. 

Referring now to Figure 4, there is shown a processor 40 of the invention. The 
20 processor 40 incorporates eight processing units to Pg. with respective 

associated two-port memories Mj to Mg. The unit Pj is also associated with a 
two-port memory Mq. The units P| to Pg are connected to respective decoders 
D| to Dg and input/output ports I/Oj to I/Og. The input/output ports I/Oi to 
I/Og are shown in simplified form to reduce illustrational complexity, but will be 
25 described in more detail later. Each is arranged to accept up to four digital 
words simultaneously in parallel, and to transfer them serially to a corresponding 
processing unit Pj to Pg. They also provide for serial word output. 

The ith processing unit Pj (i =" 1 to 8) is associated with a respective data bus 
30 Bj and memory address bus Aj. The ith address bus Aj connects processing unit 
Pj to memories "Mj and Mjlji Each^ of the inpur-output ports I/Oj'to I/Og 
has complex read/write and data input/output connections (not shown) to external 
circuitry. These will tx illustrated in detail later. In Figure 4, they are 
indicated schematically by respective buses 41 1 to 41 g. The ith data bus Bj 
35 connects processing unit Pj to memories M\ and Mj_j, to port I/Oj and to a 
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block of -word registers indicated generally by 42. The register block 42 
incorporates three sections 42^ to 423 each of four registers R^ to R 34 , the ith 
section 42j (i = 1 to 3) consisting of registers to Rj 4 . The block 42 also 
includes a fourth section 42 4 consisting of one register R 41 . Each register Rjj is 
shown with a single or double arrow indicating its input side (left or right) and 
the number of digital words stored; ie single and double arrows correspond to 
one and two stored words respectively- Each register is a first in, first out 
(FIFO) device. Registers Rjj, R2i» R31 and R 4 j are one word devices receiving 
input from the left and providing output to the right. Register inputs and 
10 outputs are unreferenced to reduce illustrational complexity. Registers R 12 , R22 
and R32 are also one word devices, but input from the right and output to the 
left. Registers R 13 , R 14 , R23, R24, R33 and R 34 are two word devices which 
input from the left and output to the right. 



15 



20 



The ith section of registers 42j (i = 1 to 4) is connected to data bus Bg-j to its 
left, each register having a respective bus branch connection. The upper three 
registers (eg R32 to R 34 ) of the ith section 42} (i =1 to 3) are connected to 
data bus Bj + | (eg B 4 ) to their right. However, the lowermost register Rj| in 
the ith section 42j (i = 1 to 4) is connected to data bus Bj. 



The processing units P| to Pg have respective read-write output lines R/Wj to 
R/W 8 connected to ports I/O^ to I/Og, associated memories Mq-Mj to M 7 -Mg 
and registers R n , R 12 to R 2 i etc. The lines R/Wj etc are each two bits wide 
as indicated by /2. The units Pj to Pg are also connected to their respective 
25 decoders to Dg by three-bit chip address lines Cx to Cg marked /3. 

Each of the decoders D| to Dg has seven one-bit output lines such as D2 lines 
44 for example, and these lines are connected to respective memory, I/O port 
and register devices M| , I/O}, R n etc. Some decoder lines such as those at 46 
30 of D 5 are surplus to requirements. These are left unconnected as indicated by 
X symbols. X symbols also indicate unconnected buses below memories M 0 and 
• Mg. 



35 



The mode of operation of the processor 40 as compared to that of the prior art 
device 10 is illustrated in Figure 5. In this drawing, conceptual locations of 
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internal cells 12 in the device 10 are indicated as rectangles such as 50. The 
scale of the drawing is vertically foreshortened for illustrational convenience. 
Each of the processing units to Pg executes the computational tasks of a 
. respective . fifteen internal cells 12. . In accordance with this, each rectangle 50 

05 incorporates within it a number indicating the associated processing unit; ie 
rectangles 50 (and the internal cells 12 they represent) referenced internally with 
the numeral i (i = 1, 2, ... 7 or 8) are associated with processing unit Pj. 
Each rectangle also has external upper left and lower right indices VI and V2 
respectively, where VI is in the range 1 to 15 and V2 = VI +15 in each case. 

10 VI and V2 respectively correspond to the first and second intervals in time at 
which the relevant processor in each case carries out the function of the internal 
ceil associated with the location. The drawing also includes diagonal lines 
representing memories M 0 to Mg. Dotted lines 52 link different regions 
associated with respective common memories. Locations representing register 

15 sections are indicated by multi-cornered lines with like references 42j to 42 4 . 

In operation, each processing unit P| executes in sequence processing tasks which 
would be carried out by a respective fifteen internal cells 12" in the prior art. A 
cycle of operation of the prior art systolic array 10 therefore requires fifteen 

20 cycles of the processor 40 of the invention. The latter will be referred to as 
subcycles. Subcycles 1 to 15 consequently correspond to cycle 1, subcycles 16 to 
30 to cycle 2 and so on. Numerals VI and V2 in Figure 3 are subcycie 
numbers. On subcycles or VI values 1 to 15, processing unit Pj executes the 
processing functions of internal cells located in the lower sections of the two 

25 lowest diagonals of the array 10, as indicated by numeral 1 within corresponding 
rectangles 50 in Figure 5. Unit P} begins on subcycie 1 with a computation 
corresponding to the function of that internal cell 12 in the centre of the lowest 
diagonal of the array of cells, as indicated by an upper left VI value of 1. On 
subcycie 2, as indicated by VI s&£» unit P] carries out a first cycle computation 

30 corresponding to the lowermost internal cell 12 in the final (rightmost) column. 
On subcycie 3, the computation is that of the* internal cell 12 in the penultimate 
(second lowest) row and final column. This procedure is repeated on successive 
subcycles, the conceptual location of the processing operation reducing by one in 
row or column position alternately. After subcycie 15, ie after the end of cycle 

35 l, the computation executed is that of the lowest internal cell at the centre of 
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the lowest diagonal once more, as indicated by V2 = 16, and thereafter the 
sequence repeats to implement cycle 2. 

Similar processing sequences are executed by processing units P 2 to Pg. Units 
P2, P3 and P 4 carry out computations corresponding to respective pairs of part 
minor diagonals. The equivalents for units P 5 , P 6 and P 7 are two respective 
complete minor diagonals together with two respective part minor diagonals. For 
unit P 8 , there is a single upper right location and the upper parts of diagonals 
having lower parts associated with Pj. 

Each of the units Pj to Pg reads from and writes to respecive memories among 
Mq to M 8 and register sections 42j to 42 4 . Memories and registers are 
illustrated in Figure 5 adjacent the conceptual internal cell locations to which 
they are interfaced. For example, throughout each cycle processing unit P l 
communicates with memories M 0 and Mj, but also communicates with register 
section 42! on subcycle 1 (ie one subcycle per cycle). Unit P 2 communicates 
with register section 42j on subcycle 1 and both register sections 42 x and 42 2 on 
:le 3. ' * - 



The mode of operation of the processor 40 of the invention will now be 
described in more detail with reference to Table 1 and Figure 6 to 8. Parts in 
Figures 6 to 8 and Table 1 which were described earlier are like referenced 
Figure 2 illustrated each internal cell 12 receiving input of three quantities c, s 
and x in , performing a computation and generating outputs c, s and x out . This 
is re-expressed in Figure 6 as three read operations RE1 to RE3 and three write 
operations WR1 to WR3. In Figure 7, the nth processing unit P n (n = 1, 2, ... 
7 or 8) is shown connected between memories M n . T and M n . It incorporates 
processing logic responsive to a stored program in local (ie internal) memory 
which also contains a data address look-up table and a coefficient store. The 
look-up table is a list of fifteen address sets, ie. one set per subcycle. The 
.coefficient. store has space for fifteen updatable coefficients of the kind r, and for 
temporary storage of a value for which an output delay is required. In Figure 
8, the lower right hand region of Figure 5 is shown on an expanded scale. 
Memories M 0 to M 3 are shown subdivided into individual address locations 
labelled with integers. Not all address locations illustrated are employed. As in 
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Figure 5, in Figure 8 each processing unit Pj to P3 (indicated by the relevant 
numerals within boxes) has an upper left numeral indicating subcycle number. In 
Table 1 , addresses in memories Mq to M3 are given for read and write 
operations in processing units Ej .to P3 on subcycles 6 and 7. Addresses shown 
05 in Figure 8 and Table 1 are in the range 0 to 22 for illustrational convenience, 
although in practice a typical memory address space would be 256 (8 bits) or 
greater. 

As has been said, processing begins on subcycle 1, the first subcycle of the first 
10 cycle. However, it turns out that the first subcycle of each cycle is in fact a 
special case. In consequence, read/write operations on subcycles 2 onwards will 
first be described as being typical, and those of subcycle 1 will be discussed 
later. 

15 Processing unit P| operates as follows. Referring to Figures 4, 7 and 8 once 
more, for each subcycle the stored programme in local memory has three 
successive read instructions; each of these requires data to be read from three 
data addresses of a respective address set stored in the local memory look-up 
table and corresponding to the current subcycle of operation. The look-up table 

20 also stores values for the chip address lines Cj, which are equivalent to a 
three-bit extension of the address bus Aj. In Table 1, Mj,Z designates address 
Z in memory M n . On subcycle 2, the read operations RE1, RE2 and RE3 for 
processing unit Pj are from addresses M|0, Mq8 and Mq7 respectively. The unit 
P| places an address on address bus A| corresponding to Z = 0, and places a 

25 three-bit code on chip address lines providing for to be enabled by 
decoder Dj and for Mq, Rh and I/0| to be disabled. It also places a two-bit 
"read" code on read/write line pair R/Wj to signify a read opera: on. This 
causes memory Mj to place the contents of its address 0 on the data bus Bj, 
where it is read; by processing unit as RE1 ; and temporarily stored. Unit P^ 

30 then changes the code output at C\ to that required for decoder D| to enable 
Mq» and changes the address on bus A{ to Z = 8 arid subsequently . = 7/. 
This provides for successive read operations RE2 and RE3 from addresses 8 and 
7 of M 0 . 
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Having carried out three read operations in succession on subcycle 2, unit P l 
executes the internal cell computations shown in Figure 2 to generate x out and r 
(updated) for the second conceptual internal cell with which it is associated. It 
replaces its internally stored value of the second of fifteen coefficients r (initially 
zero) by r (updated), and is then ready to output the newly computed x out 
together with two unchanged input values (c, s input as RE2, RE3). On 
subcycle 2, unit P 1 performs the function of the lowermost internal cell 12 of 
Figure 1, which provide c, s and x out signals to destinations outside the internal 
cell sub-array. In the processor 40, this situation is implemented by write 
operations to an input/output port. The processing unit Pj consequently executes 
three successive write operations to port I/Oj. It obtains from its look-up table 
the next three chip address codes. These are in fact the same code, that 
required to access port I/Oi and for which no address on bus Aj is needed. 
They form the second half of the first address set. Unit Pj places on chip 
address lines C : the chip address code obtained from the look-up table. This 
activates decoder D r to enable port 1/0^ and unit Pj subsequently places a 
two-bit "write" code on line pair R/W] and places values x out , c and s in 
succession on data bus Bj as WRl, WR2 and WR3 respectively. This routes the 
values to subsequent signal processing circuitry (not shown) interfaced to port 
20 I/O] . 

Subcycle 2 ends when WR3 has been output, and processing unit P] proceeds to 
implement the subcycle 3 functions. These require reading from M 1 5, M 0 10 and 
M 0 9, and writing to M t 0 (WRl) and I/Oj (WR2 and WR3), which form the 
third address set of unit Pj. The WRl function overwrites the contents of MjO 
read on the preceding subcycle. Unit P a also computes and internally stores an 
updated R-matrix element r appropriate to its third associated internal cell 
location (VI = 3). On later subcycles, as shown in Figure 8, the read and write 
operations are to and from memory addresses in M 0 and Mj. Table 1 gives the 
read and write memory addresses in M 0 to M 3 and port I/O3 for processing 
--- - . -units P] to P 3 on subcycles ~6 and "77 '" " ' 

Processing unit Pj reads from memories MtfMx and writes to those memories 
and/or port I/O] exclusively during subcycles 2 to 15. On subcycle 1 however, 
35 as indicated in Figure 5, it is interfaced with register section 42] immediately 



WO 92/03802 



PCT/GS9I/CI390 



- 15 - 

above. As shown in Figure 6, RE1 is read from above. It is therefore received 
from register R| j of register section 42| in response to an enable signal from 
decoder Dj. Register Rjj receives input from the eighth processing unit Pg on 
later cycles. . ■ \ ■ 

05 

Subcycle 1 is a special case in the operation of processing unit Pj. So also are 
subcycles 16, 31 etc, ie the first subcycle of each cycle and numbered 
(15(n-l) + 1), n = 1, 2, 3 etc. These are also special cases for the other 
processing units to p 8- .The reason is as follows. In the simulated systolic 
10 array 10, data and result flow is downwards and to the right. It progresses at 
the rate of one cell per clock cycle along rows and down columns. An internal 
cell having a neighbour to its left or above receives data from the neighbour 
which the neighbour used or computed one cycle earlier. In the processor 40 
however, as shown in Figure 5, a processing unit (P^ etc) proceeds conceptually 
15 upwards and to the left on successive subcycles in the reverse of the systolic 
array data flow direction. In consequence of this, inputs to a processing unit P| 
etc from a neighbouring location are not generated one cycle earlier, but instead 
one cycle minus "one subcycle earlier. For most of each cycle this difference is 
immaterial. However, on subcycle 1 (and later equivalents) the right hand 

20 neighbouring location corresponds to subcycle 15; ie these two subcycles are the 
beginning and end of the same first cycle. The right hand location (VI = 15, 
V2 = 30) is fourteen subcycles behind the left hand location (VI = 1, V2 = 16) 
in this special case, instead of being one subcycle ahead as elsewhere in the 
cycle. In consequence, in the absence of arrangements to the contrary, right 

25 hand outputs (values c, s output as WR2, WR3) from processing unit P^ on 
subcycle 1 of the first cycle would be used as inputs on subcycle 15 of the first 
cycle. Similarly, the vertical output (x out = WR1) from processing unit Pj on 
subcycle 1 to memory Mq would occur too early. This would conflict with the 
systolic array - processing requirement that • a result generated by a processing cell 

30 on one cycle is to be employed by a neighbour to its right or below on the 
■-■ suGccxsdifig~Qyde~ to all other processing units ?2 10 p 8 

To deal with this timing problem, on the first subcycle of each cycle, ie subcycle 
(15(n-l) + 1), n = 1, 2, 3 etc, the processing units P^ to Pg store internally 
35 their current values of x out , c and s. They each output as WR1, WR2 and 
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WR3, the respective values of x out , c and s which they stored on the preceding 
cycle (if any). In consequence, on (and only on) the first subcycie of each 
cycle, outputs from the processing units P 1 to Pg are delayed by one cycle. 
This involves an additional three. storage;, locations „ in each processing unit's 
05 interna! coefficient store. 

At the end of subcycie 15, the first cycle of operations of unit P a is complete 
and subcycie 16 begins the second cycle (V2 = 16 to 30). As shown in Figure 
5, processing unit reverts to execution of the computation of the internal cell 
12 at the centre of the lowermost diagonal. On cycle 2 (subcycles 16 to 30) the 
unit P x reads in data (if any) stored in memories Mq and M x and register R n 
during cycle 1. It temporarily stores three data values to implement a one cycle 
delay. It also stores fifteen values of r in the process of updating, each r 
corresponding to a respective prior art internal cell. 
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Similar remarks apply to other processing units P2 to Pg and to later cycles. In 
general, a processing unit P n reads from and writes to its associated memories 

~Mn-i and. M n (ri = 1 to 8) for most of a cycle. Exceptions to this are as 
follows. Units P 5 to Pg execute RE1 ( Xin ) from respective ports I/O5 to I/Og 
when performing computations corresponding to internal cells 12 in the uppermost 
row of the Figure 1 prior art array 10. Units P 5 , P 6 and P 7 are in this 
situation on four subcycles of each cycle (eg unit P 5 on subcycles 6 to 9 of 
cycle 1), but unit Pg only three. This "uppermost- RE1 operation is equivalent 
to input of an element of a data matrix X (see Figure 3) to an array 10. All 
eight processing units P 2 to Pg execute processing functions corresponding to 
internal ceils in extreme right hand column locations at respective points in each 
cycle. Units Pj to P 7 are in this situation for two subcycles per cycle, whereas 
the equivalent for unit P 8 is only one subcycie. When in this situation, the 

.units. P^ .to : Pg execute WR2 - and WR3 to respective ports V0 1 to I/Og; P a 
executes WR1 to I/O] also for one of its two subcycles in this situation as 
previously described. This extreme right hand output function corresponds to 
output from the prior art internal ceil sub-array of Figure 1., Finally, a 
processing unit P$ (i = 1 to 8) reads from or writes .to units P 1(M and/or P 9 _ { 
via the intervening register block 42. Each register such as R 12 or R 13 is a 
one or two word temporary storage device arranged on a first in, first out 
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(FIFO) basis, as has been said. The registers Rjj to R 41 provide for 
communication between processing units which either do not share a common 
memory or require additional storage to avoid simultaneous memory addressing by 
two units. For example, as shown in Figure 5, on each of subcycles 3 and 5 
05 processing unit P3 performs two read operations RE2 and RE3 from unit P 7 via 
registers R 2 3 and R24 of register block 42 2 . On subcycle 3, unit P 3 reads the 
first stored words in registers R23 and R24, and on subcycle 5 it reads the 
succeeding words stored therein. It also reads the contents of R31 as RE1 and 
writes to R22 as WR1 on subcycle 5. Other read/ write operations are to and 
from memories M2 and M3. Similar remarks apply to other pairs of processing 
units interfaced together via the register block 42. 



10 



The processing units Pj to Pg operate in synchronism under the control of an 
external clock (not shown). . This is similar to prior art systolic arrays and will 
15 not be described. As illustrated and described with reference to Figures 5, 6 
and 8, the phasing of the read and write operations of the processing units P^ 
to Pg is arranged to ensure that each of the memories Mq to Mg is required to 
respond only to a single address input at any time. For example, on subcycle 5 
in Figure 8, units Pj and P2 carry out read-write operations to memories M<yMj 
20 and Mj/M 2 respectively, which could cause a clash, in access to Mj. However, 
unit P l begins (RE1) by reading from Mi when unit P 2 is reading from M 2 . 
Consequently, the ? l RE2 and RE3 operations are both from M 0 , at which time 
P2 has switched to addressing Mj. This phasing of read operations avoids 
memory address conflict. Similar remarks apply to write operations and 
25 processing units P3 to Pg and memories M3 and Mg. A read operation is at 
the beginning of any subcycle and a write operation is at the end. A memory 
(eg. Mj) may consequently experience read and write operations on the same 
subcycle without conflict of addresses on an address bus (eg A 2 ); however, in 
general two simultaneous operant a single memory .must, be. avoided. 
30 It is of course possible to , accommodate such conflict at the expense of 
. ... duplication of address abuses and memories. .■ • - 

Referring now also to Figure 9, in which parts previously described are 
like-referenced, the structure of each of the input/output ports I/O} to I/Og is 
35 shown in more detail. Subscript indices to references (eg l .in I/Oj) are omitted 
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to indicate aU parts of the relevant type are referred to. The port I/O 
incorporates a four-word parallel-in/serial-out input register 60, together with a 
one-word parallel-in/parallel-out register 62. The input register 60 has four data 
in P ut buses such a s 64, and four write control inputs such as 66 connected to a 
05 common , write line 68. The output register 62 has an output bus 70 and an 
associated read output line 72. The read/write line pair R/W of Figure 4 
incorporates a read line 74 connected to the input register 60 and a write line 
76 connected to the output register 62. The two-way data bus B is connected 
to both registers 60 and 62. The connections 64 to 72 inclusive were indicated 
collectively by bus 41 in Figure 4. 

The port I/O operates as follows. Immediately prior to. the first subcycle of 
each cycle of operation of the processor 40, the write line 68 of the input 
register 60 is pulsed, and four digital data words are applied simultaneously to 
respective register inputs 64. This overwrites existing register contents and loads 
the four words in the register 60 in a successively disposed manner. Each time 
the read line 74 is pulsed, the word associated with the right hand input 64 is 
: -placed-, on -the— data -bus B, and the remaining words are shifted to the right. 
This provides for the four loaded words to be output on the data bus B one 
after the other in response to four successive read line pulses. Referring now 
also to Figure 5 once more, it can, be seen that processing unit P 6 requires to 
read data from I/Og when it is performing top row computations on subcycles 
O^values), 19, 20, 25 and 26. On each of these subcycles, the unit P 6 will 
send out a respective read pulse on line pair R/W 6 , and requires a respective 
digital word to be placed on data bus B 6 by its input register 60 consisting of 
the correct matrix element Xij of the data matrix X previously referred to. Unit 
P 6 deals with the 5th, 6th, 11th and 12th top row cell locations. Matrix 
elements of the kind x n>6 , x n _ lf7 , x n ^ 12 and x^^ are therefore 
simultaneously mput to. the register 60. of unit Pg. Here,.n is a positive integer, 
and n-k less than or equal to zero is interpreted as x n _ M equal to zero for all 
q,...Input is .at the end-of the last (fifteenth) subcycle of each cycle as hasbeen-- 
indicated. This ensures that the -data is present to be read in over the next 
cycle by different processing units executing top row computations at different 
times. The processing unit P 6 reads in data words in reverse order (ie x n _ 7 13 
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leading) on two successive subcycles followed by a four subcycle gap then two 
further subcycles. 

Similar remarks apply to input to processing units P5, P7 and Pg. Unit P5 
05 requires input from port I/O5 on four successive subcycles, whereas three 
successive inputs from port I/Og suffice for unit Pg. Unit P7 requires input 
from port I/O7 on four subcycles of which the first and last pairs are separated 
by eight subcycles. 

10 in practice, the input registers 60 of the processing, units P5 to Pg are arranged 
and loaded in parallel; they receive data simultaneously once per cycle. This 
occurs immediately prior to processing unit Pg computing the function of the 
uppermost and extreme right location (VI = 1, V2 = 16) in Figure 5. It 
simulates the prior art systolic array 10, which receives top row inputs 

1 5 simultaneously. The contents of the registers 60 are overwritten by each 
successive input. As will be described later in more detail, meaningful data (ie 
xi 2) is first processed by unit Pg on subcycle 30, the data having been input at. 
60 prior to subcycle 16.~ "Thereafter the "data"" remains in the registers 60" until 
overwritten at the end of subcycle 30. 

20 

Output from the processor 40 via a port I/O of the kind shown in * Figure 9 is 
comparatively simple. A write pulse on the line 76 clocks the contents of data 
bus B into the output register 62. The read output line 72 is pulsed by external 
circuitry (not shown) to read the register contents on to the output bus 70. A 
25 succeeding write pulse at 76 provides for the register contents to be overwritten. 
External circuitry (not shown) is arranged to read from the output rc ver 62 up 
to five times per cycle, this being the maximum number of output values per 
cycle from a single port I/O. In the present example of the invention, 
processing units P| to P4 only require, output facilities such as output register- 62. 
However, it is convenient to treat all units P| ; to Pg as having like I/O ports. 



30 



Referring to Figures 1 and 5 once more, it is useful to compare the operation of 
the prior art device 10 with that of the processor 40 of the invention. The 
device 10 employs signal flow progressing generally downwards and to the right, 
35 each of the cells 12 to 18 being clock activated and operating on every clock 
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cycle in phase with one another. In the processor 40 of the invention, this 
scheme is at least conceptually preserved from cycle to cycle. Each processing 
unit P 2 etc shown in Figure 5 receives data from the equivalent of above and to 
the left and outputs to the equivalent of below and to the right. In the case of 
processing unit Pj on subcycle 1 (VI = 1), it receives from register section 42 t 
"above" and memory Mq "to the left". It subsequently provides (internally 
delayed) outputs "below" and "to the right" via memory Mq, the outputs to the 
right being for use on the next cycle (subcycle 16 = V2). However, within a 
cycle, each of the processing units P x to Pg deals with the conceptual internal 
cell locations allocated to it in reverse order compared to prior art data flow. 
Thus the First locations to be processed are those lying on an upper right to 
lower left diagonal (VI = l). Locations are processed in succession upwardly 
and to the left; eg processing unit Pj executes computations corresponding to 
internal cell locations in which the row and column numbers reduce alternately 
15 by unity between successive subcycles. For units P 5 to Pg, a discontinuous shift 
occurs after top row subcycles. On subcycle 1, the computations of units P 3 to 
Pg correspond to internal cell locations on an array diagonal extending to the 
u PP er -right -hand corner,— Unit Pj on subcycle 15 is processing the internal cell 
location at row (9-i), column (8 + i) (i = 1 to 7) in Figure 5. (For 
20 comparison with Figure 1, the column number should be increased by 1 to allow 
for the extra column incorporating the uppermost boundary cell 14.) On 
subcycle 1, the equivalent for i = 1 to 8 is column (7 + i) with unchanged row 
number (9-i). 

25 The reason for the conceptual reversing of the order of processing internal cell 
locations as indicated in Figure 5 is to ensure that intermediate computed values 
stored in memories or registers are not overwritten before they are needed. For 
example, referring to Figure 8 once more, on subcycle 3 processing unit 
overwrites the contents., of -address MjQ which it read on the previous subcycle. 

30 The new value written to address Mtf remains there to be read and then 
overwritten on tiie subsequent + cycle -fourteen subcycles later. If this procedure 
were to be reversed, the contents of address M t 0 would be overwritten before 
being read during a cycle. In this connection it is emphasised that each of the 
processing units P t to Pg employs inputs generated on the preceding cycle and 

35 required to be unaffected . by intervening computations. To avoid unwanted 
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overwriting of stored data without the af resaid order reversal, it would be 
necessary to provide storage (double buffering)' and address and data buses 
additional to that shown in Figure 4. 

05 The conceptual reversal of the location processing order and the relative phasing 
of the operations of the processing units Pj to Pg is implemented by the 
respective list of data addresses in each processing unit as illustrated in Figure 7. 
The addresses in any list are accessed in succession, and when a processing unit 
reaches the end of its list it begins again. The relative phasing illustrated in 

10 Figure 5 is implemented by assigning the processing units Pi to Pg appropriate 
start points for their respective address lists. 

The foregoing analysis relating to Figures 4 to 9 has not referred to the matter 
of processor start-up. It was assumed implicitly that, from VI value or subcycle 

15 l onwards, the processor 40 was processing data. In the prior art, as shown in 
Figures 1 to 3, it takes 15 cycles after input of x^ to the topmost boundary 
cell 14 for yj to be input on cycle 16 to the internal cell 12 in the upper right 
corner. A further fourteen cycles are required for a cumulatively processed result 
arising inter alia from yj to reach the lowermost internal cell 12. The start-up 

20 phase for a prior art systolic array 10 consequently passes as a wavefront from 
upper left down to lower right, the wavefront extending orthogonally to its 
propagation direction. An equivalent start-up phase occurs in the processor 40 
of the invention. The first processing unit to operate on meaningful input data 
is Pg on subcycle 30 at the top left hand corner of Figure 5. Subcycle 30 is at 

25 the end of the second cycle during which xj2 is to be input to processing unit 
Pg. On this subcycle, unit Pg is carrying out the processing task of the first 
(leftmost) top row internal cell 12 shown in Figure 1, which receives successive 
matrix elements of the kind x,^ (n = 1, 2...). On subcycles 44 and 45, which 
are in the third cycle (not shown), unit Pg reads in . xj3". and .x^^ respectively to 

30 carry out the functions of the first and second top row internal cells 12 of 
- Figure 1. This start-up phase proceeds along rows and down columns in the 
Figure 5 representation. Eventually, on subcycle 437, the second subcycle of 
cycle 30, processing unit Pj receives inputs derived from X|| to xj i$ and yj. 
It computes a result corresponding to the first meaningful output from the 

35 lowermost internal cell 12 in the Figure .1 processor 10. The start-up phase is 
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then complete. Start-up phases are well understood in the fields of systolic 
arrays and digital electronics and will not be • described further. It may be 
desirable to make provision for ignoring or inhibiting those outputs from the 
processor 40 which do not correspond to real inputs. 



The processor 40 requires an equivalent of the chain of boundary cells 14 and 
delay latches 16 in order to operate on a data matrix X- It is necessary to 
compute parameters c and s from values output by processing units Pj and Pg, 
and temporarily stored in memories Mq and Mg for use one cycle later in each 
case. This is exemplified in Figure 8 for example on subcycle 6. On this 
subcycle, unit Pj executes WR1 to M 0 17; ie address 17 in memory M 0 receives 
the equivalent of a vertical output Of an internal cell 12 destined for a boundary 
cell 14. The memory Mq is therefore required to be interfaced to a device 
which will access MQ17, compute c and s rotation parameters as shown in Figure 
2, and write c and s to MqU and M&3 respectively for use on the next cycle. 
This is to be carried out on alternate subcycles, ie each occasion that unit Pj is 
shown closely adjacent to memory M 0 in Figure 5. Similarly, memory M 8 is 
required to be interfaced to a second- like device arranged to access it on 
alternate subcycles for computation and return of rotation parameters. This 
second like device is required to receive matrix element x n on the first cycle of 
operation as indicated in Figures 1 and 3. It is also required to receive 
subsequent row leading matrix elements x nl (n = 2,3 ....). It will act as the 
uppermost boundary cell 14 in Figure 1 to generate c and s rotation parameters 
to be read as RE2 and RE3 by processing unit P 8 at the end of the second 
25 cycle (VI = 30). These devices are straightforward to implement in practice. 
They will be processing devices similar to units Pj to Pg and interfaced to 
respective memories Mq and Mg via the data and address buses shown truncated 
in Figure 4. 



The processor 40 of the invention incorporates processing units P x etc with 

internal ; memory containing an address -look-up table -and a store for three 

delayed values and fifteen coefficients- in addition to a programme. It is also 
possible to employ simpler processing devices with less internal memory capacity. 
In this case, the memories M 0 etc might contain address lists and value and 
coefficient stores, and be associated with counters for counting through address 
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lists of respective processing devices. It may however be less convenient to 
implement and result in slower processing. This is because commercially 
available discrete processing devices such as transputers incorporate sufficient 
internal memory for the purposes of the processing units Pj etc, and it would be 
05 inefficient not to use such facilities. However, the processor 40 might well be 
implemented as an integrated circuit chip or wafer in which individual processing 
units, registers and memories become respective areas of silicon or gallium 
arsenide. In this case the most convenient balance between local and remote 
memory may be chosen. 

10" 

The processor 40 is designed for the situation in which eight processing units Pj 
to Pg carry out the functions of one hundred and twenty internal cells 12. In 
general, a triangular sub-array having n internal cells per (non-diagonal) outer 
edge has n(n+l )/2 cells. This number may be factorised either as n/2 and (n+1 ) 
15 or as n and (n+l)/2. Since n is a positive integer, one of n and (n+1) must be 
an even number. Consequently, n(n+l)/2 can always be factorised to two whole 
numbers, one of which may be treated as the number of processing units and 
the other the- number of internal cells allocated" to - each processing unit: 
However, it may be necessary for there to be an odd number of processing 
units, as opposed to the even number (eight) employed in the processor 40. 



20 



Referring to Figure 10, there is shown an alternative form of processor of the 
invention, this being indicated generally by 140 and incorporating an odd number 
(seven) of processing units. Parts in Figure 10 equivalent to those illustrated in 
25 Figure 4 have like reference characters P, M, D or R with asterisks. Subscript 
indices are changed to run from 1 to 7 instead of 1 to 8. The processor 140 is 
very similar to that described earlier, and will not be described in detail. 
Instead, it is observed that the only substantial difference between the processor 
140 and the earlier embodiment is that the former has no direct -equivalent of. 
30 processing unit P4. It has no direct equivalents of M4; and R41 in 

. ; - . _ ..... consequence. -Units P4 to, P7 are in - fact equivalent- to units -P5 — to - Pg 
respectively. 

Figure 11 shows the relative phasing of operation of the processing units P^ in 
35 terms of VI and V2 values as before. It is referenced equivalently to Figure 5, 
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and shows that the processor 140 performs the function of a 13 x 13 triangular 
sub-array; ie n(n+l)/2 is 13 x 7 or 91. Each of the seven processing units P* 
to P7 corresponds to thirteen internal ceils as shown in the drawing. There are 
accordingly thirteen subcycies per cycle. In other respects, the processor 140 
operates equivalently to the . earlier embodiment and will not be described in 
detail. 



Comparison of the regular structures of Figures 4 and 11 demonstrates that the 
invention may be constructed in modular form by cascading integrated circuit 

I 0 chips. Each chip could contain two (or more) processing units such as P2 and 
p 8 together with their associated registers to R14, memories M2 and Mg etc. 
Processing units surplus to requirements on part-used chips would be bypassed. 
The processors 40 and 140 each employ one more memory M(/Mq than there 
are processing units P a etc. This may be accommodated by the use of an extra 

15 external memory rather than a largely bypassed integrated circuit. Alternatively, 
it is possible to omit M 0 and connect buses Aj/Bj to Mg. This provides for 
units Pi and Pg together with rotation parameter computing means (previously 
mentioned) to address a common- memory Mg. — Similar" remarks apply to M^Mg- 
It may constitute a cumbersome alternative, since it imposes substantial access 

20 requirements on memory Mg or Mg. 

The foregoing discussion was directed to the use of n/2 or (n+l)/2 processing 
units to carry out the function of an n x n triangular array of n(n+l )/2 
processing units. This may frequently be an optimum implementation, since it 

25 combines a substantial reduction in the number of processing units required with 
a comparatively high degree of parallelism. It should be at least n/2 times faster 
than a single computer carrying out the whole computation, while employing 
l/(n+l) of the number of processing units required for a fully parallel array 
employing one unit per node as in Figure 1. However, the invention is not 

30 restricted to n/2 or (n+l)/2 processing units* simulating an h x ri triangular array. 
Figure 12 illustrates appropriate phasing of operation -for four processing units 
simulating a 16 x 16 triangular array. VI and V2 values up to 68 are given. 



35 



A processor of the invention may be arranged to simulate both non-triangular 
systolic arrays and also arrays in which there are processing cells with differing 
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computational functions. Individual cells may have more than one such function; 
eg a cell may switch between two computational functions on successive subcycles. 
For most purposes, however, such an arrangement might be undesirably complex. 



is ,. 
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CLAIMS 

1. A digital data processor for simulating operation of a 

parallel processing array, the processor (40) including an assembly of 
digital processing devices (P 1 to P 8 ) connected to data storing means 
(M 0 to M 8 , R u etc), characterised in that:- 

(a) each processing device (P x to P 8 ) is programmed to implement a 
respective list (parts of three lists are shown in Table 1) of sets 
of storing means data addresses (eg M x 12) ; 

(b) each address set (eg M x 12. M 0 20, M 0 19. M 0 17, M 0 16. M 0 15) contains 
input data addresses (eg M 1 12) and output data addresses (eg M 0 17) 
which differ, and each such set corresponds to data input/output 
functions of a respective simulated array cell (12); 

(c) each list of address sets corresponds to a respective sub-array of 
cells . (12) of the simulated array, and each such list .contains pairs 
of successive address sets (eg M 1 12, M 0 20, M Q 19. M 0 17, M Q l6, M 0 15) 

in which the leading address sets have input data addresses 
(eg M x 12) like to output data addresses of respective successive 
address sets, each list being arranged to provide for operations 
associated with simulated cells (12) to be executed in reverse order 
to that corresponding to data flow (22,24) through the simulated 
array; and 

(d) each processing device (P x to P 8 ) is programmed to employ a 
respective first address set (eg M 1 12, M 0 20, M 0 19, M 0 17, M 0 l6, M Q 15) 
to read input data from and write output data to the data storing 
means (M 0 to M 8 , R 1X etc), the output data being generated in 
accordance with a computational function, to employ subsequent 
address sets (eg M 1 17. M 0 22, M 0 21, M.12, M x l4 t M t i3) in a like 
manner until the list is complete, and then to repeat this procedure 
cyclically. 
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2. A digital processor according to Claim 1 characterised in that 
each processing device (P x to P 8 ) is arranged to communicate with not 
more than four other processing devices (P x to P 8 ), and incorporating 
storing means including register devices (eg R xl ) and memories (eg M x ) 
connected between respective pairs of processing devices (P 1 /P 8 , P i/P 2 )» 
the address set lists being such that each register device (eg R X1 ) and 
memory (eg M x ) is addressed by not more than one processing device 

(eg P t ) at a time. 

3. A digital processor according to Claim 2 characterised in that 
some of the processing devices (P 2 to P ? ) are arranged to communicate 
with two of the other processing devices (P t to P 8 ) via respective 
register devices (eg R 21 to R^) and with a further two of the other 
processing devices (P 1 to P 8 ) via respective memories (eg M^), and 
wherein the address set lists are arranged such that the register 
devices (eg R 21 to R 24 ) are addressed less frequently than the memories 
(eg M 2 ). 

4. A digital processor according to Claim l f 2 or 3 characterised 
in that some of the processing devices (P 5 to P 8 ) include input means 
(I/O) arranged for parallel to serial conversion of input data elements. 

5- A digital processor according to any preceding claim 

characterised in that each processing device (P 1 to P 8 ) is arranged to 
store and update a respective coefficient in relation to each address 
set (eg ^12. M 0 20, M 0 19, M 0 17. M 0 l6, M 0 15) in its list. 
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